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Abstract 

Next-generation sequencing technologies provide a revolutionary tool for generating gene expres- 
sion data. Starting with a fixed RNA sample, they construct a library of millions of differentially 
abundant short sequence tags or "reads" , which constitute a fundamentally discrete measure of the 
level of gene expression. A common limitation in experiments using these technologies is the low 
number or even absence of biological replicates, which complicates the statistical analysis of digital 
gene expression data. Analysis of this type of data has often been based on modified tests originally 
devised for analysing microarrays; both these and even de novo methods for the analysis of RNA-seq 
data are plagued by the common problem of low replication. 

We propose a novel, non-parametric Bayesian approach for the analysis of digital gene expression 
data. We begin with a hierarchical model for modelling over-dispersed count data and a blocked 
Gibbs sampling algorithm for inferring the posterior distribution of model parameters conditional 
on these counts. The algorithm compensates for the problem of low numbers of biological replicates 
by clustering together genes with tag counts that are likely sampled from a common distribution 
and using this augmented sample for estimating the parameters of this distribution. The number of 
clusters is not decided a prion, but it is inferred along with the remaining model parameters. We 
demonstrate the ability of this approach to model biological data with high fidelity by applying the 
algorithm on a public dataset obtained from cancerous and non-cancerous neural tissuesj^ 

1 Introduction 

It is a common truth that our knowledge in Molecular Biology is only as good as the tools we have 
at our disposal. Next-generation or high-throughput sequencing technologies provide a revolutionary 
tool in the aid of genomic studies by allowing the generation, in a relatively short time, of millions of 
short sequence tags, which reflect particular aspects of the molecular state of a biological system. A 
common application of these technologies is the study of the transcriptome, which involves a family of 
methodologies, including RNA-seq ([2S]), CAGE (Cap Analysis of Gene Expression; [TS]) and SAGE 
(Serial Analysis of Gene Expression; [H]). When compared to microarrays, this class of methodologies 
offers several advantages, including detection of a wider level of expression levels and independence on 
prior knowledge of the biological system, which is required by hybridisation-based technologies, such as 
microarrays. 

Typically, an experiment in this category starts with the extraction of a snapshot RNA sample from 
the biological system of interest and its shearing in a large number of fragments of varying lengths. The 
population of these fragments is then reversed-transcribed to a cDNA library and sequenced on a high- 
throughput platform, generating large numbers of short DNA sequences known as "reads" . The ensuing 
analysis pipeline starts with mapping or aligning these reads on a reference genome. At the next stage, 
the mapped reads are summarised into gene-, exon- or transcript-level counts, normalised and further 
analysed for detecting differential gene expression (see [TS] for a review). 

It is important to realize that the normalised read (or tag) count data generated from this family of 
methodologies represents the number of times a particular class of cDNA fragments has been sequenced, 
which is directly related to their abundance in the library and, in turn, the abundance of the associated 
transcripts in the original sample. Thus, this count data is essentially a discrete or digital measure of 
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gene expression, which is fundamentally different in nature (and, in general terms, superior in quality) 
from the continuous fluorescence intensity measurements obtained from the application of microarray 
technologies. Due to their better quality, next-generation sequence assays tend to replace microarray- 
based technologies, despite their higher cost ( ,4j). 

One approach for the analysis of count data of gene expression is to transform the counts to approx- 
imate normality and then apply existing methods aimed at the analysis of microarrays (see for example, 
[HTJ[S]). However, as noted in [T^], this approach may fail in the case of very small counts (which are far 
from normally distributed) and also due to the strong mean-variance relationship of count data, which 
is not taken into account by tests based on a normality assumption. Proper statistical modelling and 
analysis of count data of gene expression requires novel approaches, rather than adaptation of existing 
methodologies, which aimed from the beginning at processing continuous input. 

Formally, the generation of count data using next-generation sequencing assays can be thought of 
as random sampling of an underlying population of cDNA fragments. Thus, the counts for each tag 
describing a class of cDNA fragments can, in principle, be modelled using the Poisson distribution, 
whose variance is, by definition, equal to its mean. However, it has been shown that, in real count 
data of gene expression, the variance can be larger that what is predicted by the Poisson distribution 
( [m [T7| HH HI] ) ■ An approach that accounts for the so-called "over-dispersion" in the data is to adopt 
quasi-likelihood methods, which augment the variance of the Poisson distribution with a scaling factor, 
thus by-passing the assumption of equality between the mean and variance ( [21201 [211 E]). An alternative 
approach is to use the Negative Binomial distribution, which is derived from the Poisson, assuming a 
Gamma-distributed rate parameter. The Negative Binomial distribution incorporates both a mean and a 
variance parameter, thus modelling over-dispersion in a natural way ([Il[71[T3j). An overview of existing 
methods for the analysis of gene expression count data can be found in [TS] and [TD] 

Despite the decreasing cost of next-generation sequencing assays (and also due to technical and ethical 
restrictions) , digital datasets of gene expression are often characterised by a small number of biological 
replicates or no replicates at all. Although this complicates any effort to statistically analyse the data, 
it has led to inventive attempts at estimating as accurately as possible the biological variability in the 
data given very small samples. One approach is to assume a locally linear relationship between the 
variance and the mean in the Negative Binomial distribution, which allows estimating the variance by 
pooling together data from genes with similar expression levels ([I]). Alternatively, one can make the 
rather restrictive assumption that all genes share the same variance, in which case the over-dispersion 
parameter in the Negative Binomial distribution can be estimated from a very large set of datapoints 
f[T7]). A further elaboration of this approach is to assume a unique variance per gene and adopt a 
weighted-likelihood methodology for sharing information between genes, which allows for an improved 
estimation of the gene-specific over-dispersion parameters ([131). Another yet distinct empirical Bayes 
approach is implemented in the software haySeq, which adopts a form of information sharing between 
genes by assuming the same prior distribution among the parameters of samples demonstrating a large 
degree of similarity (|7j). 

In summary, proper statistical modelling and analysis of digital gene expression data requires the 
development of novel methods, which take into account both the discrete nature of this data and the 
typically small number (or even the absence) of biological replicates. The development of such methods 
is particularly urgent due to the huge amount of data being generated by high-throughput sequencing 
assays. In this paper, we present a method for modelling digital gene expression data that utilizes a novel 
form of information sharing between genes (based on non-parametric Bayesian clustering) to compensate 
for the all-too-common problem of low or no replication, which plagues most current analysis methods. 

2 Approach 

We propose a novel, non-parametric Bayesian approach for the analysis of digital gene expression data. 
Our point of departure is a hierarchical model for over-dispersed counts. The model is built around the 
Negative Binomial distribution, which depends, in our formulation, on two parameters: the mean and 
an over-dispersion parameter. We assume that these parameters are sampled from a Dirichlet process 
with a joint Inverse Gamma - Normal base distribution, which we have implemented using stick breaking 
priors. By construction, the model imposes a clustering effect on the data, where all genes in the same 
cluster are statistically described by a unique Negative Binomial distribution. This can be thought of as 
a form of information sharing between genes, which permits pooling together data from genes in the same 
cluster for improved estimation of the mean and over-dispersion parameters, thus bypassing the problem 
of little or no replication. We develop a blocked Gibbs sampling algorithm for estimating the posterior 
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Figure 1: Format of digital gene expression data. Rows correspond to genes and columns correspond to 
samples. Samples are grouped into classes (e.g. tissues or experimental conditions). Each clement of the 
data matrix is a whole number indicating the number of counts or reads corresponding to the i^^ gene 
at the j*^ sample. The sum of the reads across all genes in a sample is the depth or exposure of that 
sample. 



distributions of the various free parameters in the model. These include the mean and over-dispersion for 
each gene and the number of clusters (and their occupancies) , which does not need to be fixed a priori, 
as in alternative (parametric) clustering methods. In principle, the proposed method can be applied on 
various forms of digital gene expression data (including RNA-seq, CAGE, SAGE, Tag-seq, etc.) with 
little or no replication and it is actually applied on one such example dataset herein. 



3 Modelling over-dispersed count data 

The digital gene expression data we are considering is arranged in an AI x N matrix, where each of the 
N rows corresponds to a different gene and each of the M columns corresponds to a different sample. 
Furthermore, all samples are grouped in L different classes (i.e. tissues or experimental conditions). It 
holds that L < M, where the equality is true if there are no replicas in the data. 

We indicate the number of reads for the i*'* gene at the j*'' sample with the variable yij. We assume 
that Uij is Poisson-distributed with a gene- and sample-specific rate parameter r^j. The rate parameter 
rij is assumed random itself and it is modelled using a Gamma distribution with shape parameter Q!iA(j) 
and scale parameter Sij. The function A(-) in the subscript of the shape parameter maps the sample 
index j to an integer indicating the class this sample belongs to. Thus, for a particular gene and class, 
the shape of the Gamma distribution is the same for all samples. Under this setup, the rate can be 
integrated (or marginalised) out, which gives rise to the Negative Binomial distribution with parameters 
Q!iA(j) and fiij — ai\(j)Sij for the number of reads yij: 

I ^iyij + a^Mf)) ( a^x(3) ^l^x(j) 

r(aao))r(yjj -t- 1) V"iA(j) + J V"jao) + a^^j / 

where ^ij is the mean of the Negative Binomial distribution and ^^ij + oi^^j^lJ-ij is the variance. Since the 
variance is always larger than the mean by the quantity a~^j-^iifj, the Negative Binomial distribution 
can be thought of as a generalisation of the Poisson distribution, which accounts for over-dispersion. 
Furthermore, we model the mean as ^ij — c^e'^'-^tj) , where the offset Cj — X^i^i Vij is the depth or expo- 
sure of sample j and Pi\[j) is, similarly to aix^j), a gene- and class-specific parameter. This formulation 
ensures that ^ij is always positive, as it oughts to. 

Given the model above, the likelihood of observed reads Yn — {yij : A(j) = 1} for the i*'* gene in class 
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I is written as follows: 

p{Yu\au,l3u) = np(yy|ajA(j),^aO)) 

j 

= J|NegBinomial(yij|ajA(i),Cje''»Mj)) (2) 

j 

where the index j satisfies the condition A(j) = I. By extension, for the z*'' gene across all sample classes, 
the likelihood of observed counts Yi = {yij : X{j) — 1,1 — 1, L} is written as: 

p{Yi\aii,fiii,...,aiL,l3iL) = ]^p(Ka|ajz,/3i;) (3) 
where the class indicator I runs across all L classes. 
3.1 Information sharing between genes 

A common feature of digital gene expression data is the small number of biological replicates per class, 
which makes any attempt to estimate the gene- and class-specific parameters 6^ — {au^/Su} through 
standard likelihood methods a futile exercise. In order to make robust estimation of these parameters 
feasible, some form of information sharing between different genes is necessary. In the present context, 
information sharing between genes means that not all values of Ou are distinct; different genes (or the 
same gene across different sample classes) may share the same values for these parameters. This idea 
can be expressed formally by assuming that On is random with an infinite mixture of discrete random 
measures as its prior distribution: 

oo oo 

Oa-Y.'^''^0k' 0<Wfc<l, ^wfe^l (4) 

k=l fe=l 

where 5g' indicates a discrete random measure centered at 0^ = {a^,/?^} and Wk is the corresponding 
weight. Conceptually, the fact that the above summation goes to infinity expresses our lack of prior 
knowledge regarding the number of components that appear in the mixture, other than the obvious 
restriction that their maximum number cannot be larger than the number of genes times the number of 
sample classes. 

In this formulation, the parameters 9'^ are sampled from a prior base distribution Gq with hyper- 
parameters (j), i.e. 0^10 ~ Go(0). We assume that is distributed according to an inverse Gamma 
distribution with shape aa and scale Sq, while follows the Normal distribution with mean fip and 
variance cr|. Thus, Go is a joint distribution as follows: 

. . . > 

QfJ, I Oa, Sq,, /i^, cr^ ^ InvGamma(aQ, Sa) ■ Normal(/x^, (t|), fc = 1, 2, . . . (5) 

Given the above, can take only positive values, as it oughts to, while can take both positive and 
negative values. 

What makes the mixture in Eq. |4] special is the procedure for generating the infinite sequence of 
mixing weights. We set w\ = V\ and Wfe = 14 (1 ~ ^m) for fc > 2, where {Vl, . . . , 14} are random 

variables following the Beta distribution, i.e. 14- ~ Beta(afe, 6^)- This constructive way of sampling new 
mixing weights resembles a stick-breaking process; generating the first weight w\ corresponds to breaking 
a stick of length 1 at position 14 ; generating the second weight wi corresponds to breaking the remaining 
piece at position 14 and so on. Thus, we write: 

Wfc|afc,6fc ^ Stick(afc,6fc), /c = l,2,... (6) 

There are various ways for defining the parameters and h^. Here, we consider only the case where 
afc = 1 and — rj, with 77 > 0. This parametrisation is equivalent to setting the prior of On to a Dirichlet 
Process with base distribution Gq and concentration parameter 77. By construction, this procedure leads 
to a rapidly decreasing sequence of sampled weights, at a rate which depends on 77. For values of rj much 
smaller than 1, the weights Wk decrease rapidly with increasing fc, only one or few weights have significant 
mass and the parameters On share a single or a small number of different values 0^ . For values of the 
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Figure 2: The clustering effect that results from imposing a stick-breaking prior on the gene- and class- 
specific model parameters, On. A matrix of indicator variables is used to cluster the observed count data 
into a finite number of groups, where the genes in each group share the same model parameters. The 
number of clusters is not known a priori. The distribution of weight mass among the various clusters in 
the model is determined by parameter rj. 



concentration parameter much larger than 1, the weights Wk decrease slowly with increasing k, many 
weights have significant mass and the values of On tend to be all distinct to each other and distributed 
according to Gq. Below, we set rj — I, which results in a balanced decrease of the weight mass with 
increasing k. In particular, for rj — 1, log{'Wk) decreases (on average) in an unbiased manner with 
increasing k. 

Given the above formulation, sampling On from its prior distribution is straightforward. First, we 
introduce an indicator variable zu S {1,2, . . .}, which points to the value of 6^ corresponding to the 
j^th ggjj^g jjj class I. We sample such indicator variables for each gene in each class from the Categorical 
distribution, i.e. zu ~ Categorical(wi, u;2, ■ • ^^'^ set On = 0*.^. Although Gq is continuous, the 
distribution of On is almost surely discrete and, therefore, its values are not all distinct. Different genes 
may share the same value of 9* and, thus, all genes are grouped in a finite (unknown) number of clusters, 
according to the value of 9^ they share. Modelling digital gene expression data using this approach is one 
way to bypass the problem of few (or the absence of) technical replicates, since the data from all genes 
in the same cluster are pooled together for estimating the parameters that characterise this cluster. The 
clustering effect described in this section is illustrated in Fig. [2] 

3.2 Generative model 

The description in the previous paragraphs suggests a hierarchical model, which presumably underlies 
the stochastic generation of the data matrix in Fig. [T] This model is explicitly described below: 

9k\0'a, Sa, fJ-p, cr'p InvGamma(aQ, Sq) • Normal(/i^, (T^) 

wi,W2,---\ri ~ Stick(l,7/) 
ZiX(j)\wi,W2, ■ . . Categorical(wi, W2, . . .) 

yij\Oz\{j) ^ NegBinomial (6'ja(j)) (7) 

At the bottom of the hierarchy, we identify the measured reads yij for each gene in each sample, which 
follow a Negative Binomial distribution with parameters ^iA(j) = The parameters of the 

Negative Binomial distribution ^iA(j) ^re gene- and class-specific and they are completely determined 
by an also gene- and class-specific indicator variable 2iA(j) ^-nd the centers O^. of the infinite mixture of 
point measures in Eq. [4] These centers are distributed according to a joint inverse Gamma and Normal 
distribution with hyper-parameters (f) — {oq,, Sa, ^p, while the indicator variables are sampled from 
a Categorical distribution with weights {wi, W2, . . .}• These are, in turn, sampled from a stick-breaking 
process with concentration parameter rj. In this model, ^, w^, 0^ and Zix(j^ are latent variables, which 
are subject to estimation based on the observed data. 
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4 Inference 



At this point, we introduce some further notation. We indicate the N x L matrix of indicator variables 
with the letter Z; 8* = {61,92, ■ ■ ■} lists the centers of the point measures in Eq. |4]and W = {wi, W2, ■ ■ ■} 
is the vector of mixing weights. 

We are interested in computing the joint posterior density p{Z,W,Q* ,(j)\Y), where y is a matrix 
of count data as in Fig. [l] We approximate the above distribution through numerical (Monte Carlo) 
methods, i.e. by sampling a large number of {9*, W, Z, 0}-tuples from it. One way to achieve this is by 
constructing a Markov chain, which admits p{Z, W, 9* , 01^) as its stationary distribution. Such a Markov 
chain can be constructed by using Gibbs sampling, which consists of alternating repeated sampling from 
the fuh conditional posteriors p{&*\Y, Z, (t>), p{W\Z), p{Z\Y, Q* ,W) and p{4>\Q*,Z). Below, we explain 
how to sample from each of these conditional distributions. 

Sampling from the conditional posterior p{<d*\Y, Z,(j)) 

In order to sample from the above distribution it is convenient to truncate the infinite mixture in Eq. |4]by 
rejecting all terms with index larger than K and setting wk — ^fci which is equivalent to setting 

Vk — 1- It has been shown that the error associated with this approximation when Vk ~ Beta(l,?7) is 
less than or equal to 4A^Mexp(-^^) (0). For example, for TV = 14 x 10^, M = 6, if = 200 and = 1, 
the error is minimal (less than 10~®°). Thus, the truncation should be virtually indistinguishable from 
the full (infinite) mixture. 

Next, we distinguish between Kac active clusters (9*^) and Kin inactive clusters (9*„), such that 
9* = {9*c, 9*„} and K = Kac + K^. Active clusters are those containing at least one gene, while those 
containing no genes are considered inactive. We write: 



p{Q*\Y,Z,c^) = p{Ql,,QUY,Z,4,) 

= p(9:,|r,Z,^)p(9,:„|0) 

Updating the inactive clusters is a simple matter of sampling Kin times from the joint distribution in Eq. 
[5] given the hyper-parameters 0. Sampling the active clusters is more complicated and involves sampling 
each active cluster center 0*^, ^ individually from its respective posterior, p(()*ac k\^a.c,k) , where Yac.k is a 
matrix of measured count data for all genes in the fc*'' active cluster. Sampling 0*^ j. = {a*^ k^Pac fcl 
done using the Metropolis algorithm with acceptance probability: 

where the superscript + indicates a candidate vector of parameters. Each of the two elements {a and /3) 
of this vector is drawn from a symmetric proposal of the following form: 



q(x+|x*) =a;*exp(0.01-r) (9) 

where the random number r is sampled from the standard Normal distribution, i.e. r ^ Normal(0, 1). 
The prior of p{0*^ f.\(l}) is a joint Inverse Gamma - Normal distribution, as shown in Eq. j5j while the 
likelihood function p{Yac,k\0ac k) ^ product of Negative Binomial probability distributions, similar to 
those in Eqs. [2] and [3] 



Sampling from the conditional posterior p{Z\Y, 9*, W) 

Each element zn of the matrix of indicator variables Z is sampled from a Categorical distribution with 
weights TT,i = {ttI , . . . , Trf }, where 4 = Y.l^i and: 

{nil, {wip{Yu\ei), . . . ,WKP{Yu\e*K)} (lo) 

In the above expression, Yn is the data for the i^^ gene in class I, as mentioned in a previous section. 
Notice that zu can take any integer value between 1 and K and that the weights -ku depend both on the 
cluster weights Wk and on the value of the likelihood function p{Yii\0'^). 
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Sampling from the conditional posterior p{W\Z) 



The mixing weights W are generated using a truncated stick-breaking process with -q — 1. As pointed out 
in [S], this impUes that W follows a generalised Dirichlet distribution. Considering the conjugacy between 
this and the multinomial distribution, the first step in updating W is to generate K —1 Beta-distributed 
random numbers: 

k 

Vk ^ Beta(l + Nk.,v + N-J2^rn) (11) 

m— 1 

for k = 1, . . . ,K — 1, where iVj. is the total number of genes in the fc*'' cluster. Notice that Nf^ can be 
inferred from Z by simple counting and X]m=i — N, where N is the total number of genes. Vk is 
set equal to 1, in order to ensure that the weights add up to 1. These are simply generated by setting 
Vi — wi and Wk = Vk Y[m=ii^ ~ Kn), as mentioned in a previous section. 

Sampling from the conditional posterior p{(j)\Q*,Z) 

The hyper-parameters <p — {oq, Sq, /i^, ct|} influence indirectly the observations Y through their effect 
on the distribution of the active cluster centers, 0*^, = {alcTl^ac}' where a*^ = • ■ • , a*^ ^^^} 

and = {f3lc,n ■ ■ ■ ^Plc k^^}- we further assume independence between a*^ and we can write 
p(^|e*,Z) =p(a„,s„,^^,cr^K^,^*J =p(a„,s„KJp(M;3,CT||/3*J. 

Assuming Kac active clusters and considering that the prior for a* is an Inverse Gamma distribution 
(see Eq. [5|, it follows that the posterior p{aa, Sq|q!*c) i^- 

f I * ^ 7i°~'exp(-g^72)g°°T^ 
p{aa,Sc\a^^) oc f(7y; (^2) 

The parameters 71 to 74 are given by the following expressions: 



71 = 7i 



(0) 



Ka 



n 



a 

k=l 



k 



(0) 1 ^ 

72 = i2+l^-r 



73 = 7f +ifac 

74 = 74°'' + 

where the initial parameters 'yf\ 72°\ 73°' and 74^^' are all positive. Since sampling from Eq. 
be done exactly, we employ a Metropolis algorithm with acceptance probability 



12 



cannot 



Pace = m^n I 1, ) (13) 



p{aa,Sa\al 

where the proposal distribution q{-\-) for sampling new candidate points has the same form as in Eq. [9j 

Furthermore, taking advantage of the conjugacy between a Normal likelihood and a Normal-InverseGamma 
prior, the posterior probability for parameters /i^ and becomes: 

pifip, crp\f3*^) = NormalInverseGamma(5i, (52, S^, S4) (14) 
The parameters 61 to (given initial parameters (5^°' to S^f"') are as follows: 



(52 = 6f^+Kac 

h=1 0^ -\- Kac 



7 



where (3*^ — Pac k- Sampling a {/i^g, CT|}-pair from the above posterior takes place in two simple 

steps: first, we sample ct| ~ InverseGamma(<53, 54), where S3 and ^4 are shape and scale parameters, 
respectively. Then, we sample Normal((5i, (7^/(52). 

4.1 Algorithm 

We summarise the algorithm for drawing samples from the posterior p{Q* , Z, W, (j>\Y) below. Notice that 
x*^*-* indicates the value of x at the i*^ iteration of the algorithm, x^^'^ is the initial value of x. 

1. Set7(°) = {7r,7f,7f,7f} 

2. Set<5(«) = {5f),4°\<5f ,<5f } 

3. Set0(o)={ai°\6("\Mi''\^r} 

4. Set K, the truncation level 

5. Sample 0*(*') from its prior (Eq. ^ conditional on f/)'^"^ 

6. Set all K elements of VF^^-' to the same value, i.e. 1/K 

7. Sample Z'"-* from the Categorical distribution with weig hts W'-°^ 

8. Fort = l,2,3, ...,T 

(a) Sample Oac*'' given (/)(*~^) and the data matrix Y using a single step of the Metropolis 
algorithm for each active cluster (see Eq. |8]) 

(b) Sample Q*^^ from its prior given 0^*^^) (see Eq. ^ 

(c) Sample Z'*) given e*(*\ W'-^~'^^ and the data matrix Y (see Eq. [lOl 

(d) Sample W'^*^ given (see Eq. [u]) 

(e) Sample 0'^*^ given 8al*^ and (/)(*~-^) (see Eqs. 12 and 14 1 



9. Discard the first Tq samples, which are produced during the burn-in period of the algorithm (i.e. 
before equilibrium is attained), and work with the remaining T — Tq samples. 

The above procedure implements a form of blocked Gibbs sampling with embedded Metropolis steps for 
impossible to directly sample from distributions. 



5 Results and Discussion 

We have implemented the methodology described in the preceding sections in software and we have 
applied this software on publicly available digital gene expression data (obtained from control and can- 
cerous tissue cultures of neural stem cells; [B]) for evaluation purposes. The data we used in this study 
can be found at the following URL : [http : / / g enomebiology. com/content / supplementary / gb- 20 1 0- 11- Ifr] 
£l06-s3.tgz. As shown in Table [ij this dataset consists of four libraries from glioblastoma-derived neural 
stem cells and two from non-cancerous neural stem cells. Each tissue culture was derived from a different 
subject. Thus, the samples are divided in two classes (cancerous and non-cancerous) with four and two 
replicates, respectively. 

We implemented the algorithm presented above in the programming language Python, using the 
libraries NumPy, SciPy and MatplotLib. Calculations were expressed as operations between arrays and 
the multiprocessing Python module was utilised in order to take full advantage of the parallel architecture 
of modern multicore processors. The algorithm was run for 200K iterations, which took approximately 
two days to complete on a 12-core desktop computer. Simulation results were saved to the disk every 50 
iterations. 

The raw simulation output includes chains of random values of the hyper-parameters 0, the gene- 
and class-specific indicators Z and the active cluster centers 6aci which constitute an approximation to 
the corresponding posterior distributions given the data matrix Y. The chains corresponding to the four 
different components of <f) — {cia, Sa, fi/3,a''i} are illustrated in Figure pi It may be observed that these 
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Table 1: Format of the data by [6]. The first four samples are from ghoblastoma neural stem cells, while 
the last two are from non-cancerous neural stem cells. The dataset contains a total of 18760 genes (i.e. 
rows) . 
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Figure 3: Simulation results after 200K iterations. The chains of random samples correspond to the 
components of the vector of hyper-parameters 4>, i.e. /i^ and (t| (panel A) and and Sq, (panel B). The 
former determines the Normal prior distribution of the cluster center parameters /?*, while the latter 
pair determines the Inverse Gamma prior distribution of the cluster center parameters a* . The random 
samples in each chain are approximately sampled (and constitute an approximation of) the corresponding 
posterior distribution conditional on the data matrix Y . 

reached equilibrium early during the simulation (after less than 20K iterations) and they remained stable 
for the remaining of the simulation. As explained earlier, these hyper-parameters are important, because 
they determine the prior distributions of the cluster centers a* and j3* (hyper-parameters {ac„Sa} and 
{^^,ct|}, respectively) and, subsequently, of the gene- and class-specific parameters a and /3. 

It follows from analysis of the chains in Figure [3] that the estimates for these hyper-parameters are 
(indicating the mean and standard deviation of the estimates): Oq, = 0.83 ± 0.13, Sq = 1.00 ± 0.16, 
//^ ~ —10.01 ± 0.39 and cr^ — 5.41 ± 1.32. The corresponding Inverse Gamma and Normal distributions, 
which are the priors of the cluster centers a* and /?*, respectively, are illustrated in Figure |4] 

A major use of the methodology presented above is that it allows us to estimate the gene- and 
class-specific parameters a and /3, under the assumption that the same values for these parameters are 
shared between different genes or even by the same gene among different sample classes. This form of 
information sharing permits pulling together data from different genes and classes for estimating pairs 
of a and (3 parameters in a robust way, even when only a small number of replicates (or no replicates at 
all) are available per sample class. As an example, in Figurep^we illustrate the chains of random samples 
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Figure 4: Estimated Inverse Gamma (panel A) and Normal (panel B) prior distributions for the cluster 
parameters a* and /?*, respectively. The solid lines indicate mean distributions, i.e. those obtained for 
the mean values of the hyper-parameters Oq, Sa, Hp and cr^. The dashed lines are distributions obtained 
by adding or subtracting individually one standard deviation from each relevant hyper-parameter. 

for a and /? corresponding to the non-cancerous class of samples for the tag with ID 182-FIP (third row 
in Table [T]). These samples constitute approximations of the posterior distributions of the corresponding 
parameters. Despite the very small number of replicates (n = 4), the variance of the random samples is 
finite. Similar chains were derived for each gene in the dataset, although it should be emphasised that 
the number of such estimates is smaller than the total number of genes, since more than one genes share 
the same parameter estimates. 

It has already been mentioned that the sharing of a and /3 parameter values between different genes 
can be viewed as a form of clustering (see Figure [2]), i.e. there are different groups of genes, where 
all genes in a particular group share the same a and j3 parameter values. As expected in a Bayesian 
inference framework, the number of clusters is not constant, but it is itself a random variable, which 
is characterised by its own posterior distribution and its value fluctuates randomly from one iteration 
to the next. In Figure [6j we illustrate the chain of sampled cluster numbers during the course of the 
simulation (panel A). The first 75K iterations were discarded as burn-in and the remaining samples 
were used for plotting the histogram in panel B, which approximates the posterior distribution of the 
number of clusters given the data matrix Y . It may be observed that the number of clusters fluctuates 
between 35 and 55 with a peak at around 42 clusters. The algorithm we present above does not make any 
particular assumptions regarding the number of clusters, apart from the obvious one that this number 
cannot exceed the number of genes times the number of sample libraries. Although the truncation level 
K = 200 sets an artificial limit in the maximum number of clusters, this is never a problem in practise, 
since the actual estimated number of clusters is typically much smaller that the truncation level K (see 
the y-axis in Figure [6]A). The fact that the number of clusters is not decided a priori, but rather inferred 
along with the other free parameters in the model sets the described methodology in an advantageous 
position with respect to alternative clustering algorithms, which require deciding the number of clusters 
at the beginning of the simulation (pi). 

Similarly to the stochastic fluctuation in the number of clusters, the cluster occupancies (i.e. the 
number of genes per cluster) is a random vector. In Figure [Tj we illustrate the cluster occupancies at 
two different stages of the simulation, i.e. after lOOK and 200K iterations, respectively. We may observe 
that, with the exception of a single super-cluster (containing more than 6000 genes), cluster occupancies 
range from between around 3000 and less than 1000 genes. 
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Figure 5: Chains of random samples approximating the posterior distributions of the parameters a (panel 
A) and /3 (panel B) corresponding to the non-cancerous class of samples for the tag with ID 182-FIP 
(third row in Table [T]). These samples were generated after 200K iterations of the algorithm. A similar 
pair of chains exists for each gene at each sample class (i.e. cancerous and non-cancerous), although not 
all pairs are distinct to each other due to the clustering effect imposed on the data by the algorithm. 



It should be clarified that each cluster includes many (potentially, hundreds of) genes and it may 
span several classes. An individual cluster represents a Negative Binomial distribution (with concrete a 
and /3 parameters), which models with high probability the count data from all its member genes. This 
is illustrated in Figure [8) where we show the histogram of the log of the count data from the first sample 
(sample GliNSl in Table [T]) along with a subset of the estimated clusters after 200K iterations (gray 
lines) and the fitted model (red line). It may be observed that each cluster models a subset of the gene 
expression data in the particular sample. The complete model describing the whole sample is a weighted 
sum of the individual clusters/Negative Binomial distributions. Formally, 

1 ^ 

i=l 

where Yj is the j*'* sample and the index i runs over all N genes. We repeat that not all {aiA(j): AaO)} 
pairs are distinct. Also, clusters with larger membership (i.e. including a larger number of genes) have 
larger weight in determining the overall model. 

The proposed methodology provides a compact way to model each sample in a digital gene expression 
dataset following a two-step procedure: first, the dataset is partitioned into a finite number of clusters, 
where each cluster represents a Negative Binomial distribution (modelling a subset of the data) and the 
parameters of each such distribution are estimated. Subsequently, each sample in the dataset can be 
modelled as a weighted sum of Negative Binomial distributions. In Figure |9j we show the log of count 
data for each sample in the dataset shown in Table [l] along with the fitted models (red lines) after 200K 
iterations of the algorithm. 



6 Conclusion 

Next-generation sequencing technologies are routinely being used for generating huge volumes of gene 
expression data in a relatively short time. This data is fundamentally discrete in nature and their 
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Figure 6: Stochastic evolution of the number of clusters during 200K iterations of the simulation (panel 
A) and the resulting histogram after discarding the first 75K iterations as burn- in (panel B). After 
reaching equilibrium, the number of clusters fluctuates around a mean of approximately 43 clusters. In 
general, the estimated number of clusters is much smaller than the truncation level {K = 200, see y-axis 
in panel A) . The histogram in panel B approximates the posterior distribution of the number of clusters 
given the data matrix Y. 




Figure 7: Cluster occupancies after lOOK and 200K iterations of the algorithm. A single super-cluster 
(including more then 6000 genes) appears at both stages of the simulation. The occupancy of the remain- 
ing clusters demonstrates some variability during the course of the simulation, with clusters containing 
between 3000 and less than 1000 genes. 
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Figure 8: Histogram of the log of the number of reads from sample GliNSl, a subset of the estimated 
clusters (gray lines) and the estimated model of the sample at the end of the simulation. Each cluster 
(gray line) represents a Negative Binomial distribution with specific a and /3 parameters, which models 
a subset of the count data in this particular sample. The complete model (red line) is the weighted sum 
of all component clusters. 
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Figure 9: Histograms of the log of the number of reads from cancerous (panels Ai-iv) and non-cancerous 
(panels Bi,ii) samples and the respective estimated models after 200K iterations of the algorithm. As 
already mentioned, each red line is the weighted sum of many component Negative Binomial distributions 
/ clusters, which model different subsets of each data sample. We may observe that the estimated models 
fit tightly the corresponding data samples. 
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analysis requires the development of novel statistical methods, rather than modifying existing tests that 
were originally aimed at the analysis of microarrays. The development of such methods is an active area 
of research and several papers have been published on the subject (see [15] and [1^ for an overview). 

In this paper, we present a novel approach for modelling over-dispersed count data of gene expression 
(i.e. data with variance larger than the mean predicted by the Poisson distribution) using a hierarchical 
model based on the Negative Binomial distribution. The novel aspect of our approach is the use of 
a Dirichlet process in the form of stick breaking priors for modelling the parameters (mean and over- 
dispersion) of the Negative Binomial distribution. By construction, this formulation forces clustering 
of the count data, where genes in the same cluster are sampled from the same Negative Binomial 
distribution, with a common pair of mean and over-dispersion parameters. Through this elegant form 
of information sharing between genes, we compensate for the problem of little or no replication, which 
often restricts the analysis of digital gene expression datasets. We have demonstrated the ability of this 
approach to model accurately actual biological data by applying the proposed methodology on a publicly 
available dataset obtained from cancerous and non-cancerous cultured neural stem cells ([6]). 

We show that inference is achieved in the proposed model through the application of a blocked Gibbs 
sampler, which includes estimating, among others, the gene- and class-specific mean and over-dispersion 
of the Negative Binomial distribution. Similarly, the number of clusters and their occupancies are inferred 
along with the rest free parameters in the model. 

Currently, the software implementing the proposed method remains relatively computationally ex- 
pensive. In particular, 200K iterations require approximately two days to complete on a 12-core desktop 
computer. This time scale is not disproportionate to the production time of experimental data and it is 
mainly due to the high volume of the tested data (> 15K genes per sample) and the need to obtain long 
chains of samples for a more accurate estimation of posterior distributions. Long execution times are a 
characteristic, more generally, of all Monte Carlo approximation methods. Our implementation of the 
algorithm is completely parallelised and calculations are expressed as operations between vectors in order 
to take full advantage of modern multi-core computers. Ongoing work towards reducing execution times 
aims at the application of variational inference methods (|3]), instead of the blocked Gibbs sampler we 
currently use. The algorithm can be further improved by avoiding truncation of the infinite summation 
described in Equation |4] as described in [16] and in [23] . 

This non-parametric Bayesian approach for modelling count data has thus shown great promise in 
handling over-dispersion and the all-too-common problem of low replication, both in theoretical evalu- 
ation and on the example dataset. The software that has been produced will be of great utility for the 
study of digital gene expression data and the statistical theory will contribute to leading the development 
of non-parametric methods in general for all forms of modelling count data of gene expression. 
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